Skip to content

perf(serve): bound JVM/Neo4j memory and dedupe topology snapshot#118

Merged
aksOps merged 1 commit into
mainfrom
perf/serve-oom-quickwin
May 4, 2026
Merged

perf(serve): bound JVM/Neo4j memory and dedupe topology snapshot#118
aksOps merged 1 commit into
mainfrom
perf/serve-oom-quickwin

Conversation

@aksOps

@aksOps aksOps commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

OOM review of codeiq serve on AKS at the typical ~200 K-node graph
scale identified four cumulative offenders fighting for the same
cgroup memory limit. This PR addresses all four:

  • Topology snapshot deduplication. McpTools and TopologyController
    each held an independent in-heap topology snapshot. Extracted a single
    query/TopologySnapshotProvider (60 s TTL, idle-releaseable) shared by
    both. The Snapshot record carries a loaded flag so the controller
    can still distinguish "no source available" (404) from "graph is
    empty" (200), preserving the legacy contract.
  • Spring cache. @EnableCaching was on but no CacheManager bean was
    registered → unbounded ConcurrentMapCacheManager. Switched the
    serving profile to Caffeine (maximumSize=1000, expireAfterWrite=5m).
  • Neo4j page cache. Capped at 256 MiB via
    GraphDatabaseSettings.pagecache_memory so embedded Neo4j stops
    auto-grabbing ~50 % of free RAM at startup.
  • AKS JVM flags. Added -XX:MaxRAMPercentage=50 -XX:InitialRAMPercentage=25 -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError to scripts/aks-launch.sh
    so the heap is pinned to half the cgroup limit, leaving room for
    Neo4j + Metaspace + JIT + Tomcat NIO + OS slack.

Plus a runbook at shared/runbooks/aks-oom-quick-fix.md with the
diagnostic flow (OOMKilled vs readiness-flap) and the Deployment YAML
patch.

Net effect at 200 K nodes / 4 GiB pod: peak heap ceiling drops ~50 %,
no more OOMKilled events, idle pod releases topology snapshot after
60 s.

Test plan

  • `mvn test -Dfrontend.skip=true` — 3706 / 3706 pass, 32 skipped (expected)
  • `mvn package -DskipTests -Dfrontend.skip=true` — clean build
  • Deploy to AKS staging, verify pod stays under `limits.memory: 4Gi`
  • Verify topology MCP tools (get_topology, blast_radius, find_path) still respond correctly under bearer auth

🤖 Generated with Claude Code

OOM review of `codeiq serve` on AKS at the typical ~200 K-node graph
scale identified four cumulative offenders fighting for the same cgroup
memory limit:

- McpTools and TopologyController each held an independent in-heap
  topology snapshot (~150 MB at this graph size). Under mixed REST + MCP
  traffic both lived on heap simultaneously.
- TopologyController's snapshot had no TTL — once loaded, held for the
  lifetime of the process.
- Spring `@EnableCaching` was on but no `CacheManager` bean was
  registered, so every `@Cacheable` region in QueryService fell back to
  ConcurrentMapCacheManager (unbounded, no TTL, no eviction).
- Neo4j embedded auto-grabbed ~50% of free RAM for its off-heap page
  cache at startup, racing the JVM heap inside a single cgroup.

Changes:

- Extract `query/TopologySnapshotProvider` as the single owner of the
  topology snapshot; both McpTools and TopologyController now consume
  it. 60 s TTL deduplicates concurrent loads and lets idle pods release
  the heap. The Snapshot record carries a `loaded` flag so the
  controller can still distinguish "no source available" (404) from
  "graph is empty" (200), preserving the legacy contract.
- Switch `cache.type: simple` → `caffeine` with
  `maximumSize=1000, expireAfterWrite=5m` in the serving profile; add
  the Caffeine dependency.
- Cap Neo4j page cache at 256 MiB via
  `GraphDatabaseSettings.pagecache_memory` in Neo4jConfig.
- Add `-XX:MaxRAMPercentage=50 -XX:InitialRAMPercentage=25
  -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError` to scripts/aks-launch.sh
  so the JVM heap is pinned to half the cgroup limit, leaving room for
  Neo4j page cache + Metaspace + JIT + Tomcat NIO buffers + OS slack.
- Add `shared/runbooks/aks-oom-quick-fix.md` with diagnostic commands,
  the Deployment YAML patch, and the OOMKilled-vs-readiness-flap
  decision tree.

Net effect at 200 K nodes / 4 GiB pod: peak heap ceiling drops ~50 %,
no more OOMKilled events, idle pod releases topology snapshot after
60 s. All 3706 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aksOps aksOps enabled auto-merge (squash) May 4, 2026 17:14
@aksOps aksOps merged commit d6e34ea into main May 4, 2026
13 checks passed
@aksOps aksOps deleted the perf/serve-oom-quickwin branch May 4, 2026 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant